Contrasting Essay Scoring 1 Running head: CONTRASTING ESSAY SCORING Contrasting State-of-the-Art Automated Scoring of Essays: Analysis

نویسندگان

  • Mark D. Shermis
  • Ben Hamner
چکیده

This study compared the results from nine automated essay scoring engines on eight essay scoring prompts drawn from six states that annually administer high-stakes writing assessments. Student essays from each state were randomly divided into three sets: a training set (used for modeling the essay prompt responses and consisting of text and ratings from two human raters along with a final or resolved score), a second test set used for a blind test of the vendor-developed model (consisting of text responses only), and a validation set that was not employed in this study. The essays encompassed writing assessment items from three grade levels (7, 8, 10) and were evenly divided between source-based prompts (i.e., essay prompts developed on the basis of provided source material) or those drawn from traditional writing genre (i.e., narrative, descriptive, persuasive). The total sample size was N = 22,029. Six of the eight essays were transcribed from their original handwritten responses using two transcription vendors. Transcription accuracy rates were computed at 98.70% for 17,502 essays. The remaining essays were typed in by students during the actual assessment and provided in ASCII form. Seven of the eight essays were holistically scored and one employed score assignments for two traits. Scale ranges, rubrics, and scoring adjudications for the essay sets were quite variable. Results were presented on distributional properties of the data (mean and standard deviation) along with traditional measures used in automated essay scoring: exact agreement, exact+adjacent agreement, kappa, quadratic-weighted kappa, and the Pearson r. The results demonstrated that overall, automated essay scoring was capable of producing scores similar to human scores for extended-response writing items with equal performance for both source-based and traditional writing genre. Because this study incorporated already existing data Contrasting Essay Scoring 3 (and the limitations associated with them), it is highly likely that the estimates provided represent a floor for what automated essay scoring can do under operational conditions. A Wordle for Essay Set #1 Used in this Study (Source: www.wordle.net) Contrasting Essay Scoring 4 Contrasting State-of-the-Art Automated Scoring of Essays: Analysis Introduction With the press for developing innovative assessments that can accommodate higher-order thinking and performances associated with the Common Core Standards, there is a need to systematically evaluate the benefits and features of automated essay scoring (AES). While the developers of AES engines have published an impressive body of literature to suggest that the measurement technology can produce reliable and valid essay scores [when compared with trained human raters; (Attali & Burstein, 2006; Shermis, Burstein, Higgins, & Zechner, 2010)], comparisons across the multiple platforms have been informal, involved less-than-ideal sample essays, and were often associated with an incomplete criterion set. The purpose of this paper is to present the results of a comparison of nine AES engines on responses to a range of prompts, with some targeting content and others relatively contentfree, from multiple grades. The AES engines were compared on the basis of scores from independent raters using state-developed writing assessment rubrics. This study was shaped by the two major consortia associated with implementing the Common Core Standards, the Partnership for Assessment of Readiness for College and Careers (PARCC) and SMARTER Balanced Assessment Consortium (SMARTER Balanced), as part of their investigation into the viability of using AES for their new generation of assessments. Comprehensive independent studies of automated essay scoring platforms have been rare. Rudner, Garcia, & Welch (2006) conducted a two-part independent evaluation of one vendor of automated essays scoring, Intellimetric by Vantage Learning. After reviewing data drawn from the Graduate Management Admission TestTM, the investigators concluded, “the IntelliMetric system is a consistent, reliable system for scoring AWA (Analytic Writing Assessment) essays.” While such individual endorsements are heartening, there has yet to be a comprehensive look at Contrasting Essay Scoring 5 machine scoring technology. This is particularly important as the assessments for the Common Core Standards are under development. In part, this vendor demonstration was designed to evaluate the degree to which current high-stakes writing assessments, and those envisioned under the Common Core Standards, might be scored through automated methods. This study was the first phase of a three-part evaluation. Phase I examines the machine scoring capabilities for extended-response essays, and consists of two parts. The first part reports on the capabilities of already existing commercial machine scoring systems. Running concurrently with the vendor demonstration is a public competition in which the study sponsor (The William and Flora Hewlett Foundation) provides cash prizes for newly-developed scoring engines created by individuals or teams. Phase II will do the same thing for short-answer constructed responses followed with an evaluation of math items (i.e., proofs, graphs, formulas) for Phase III. Participants Student essays (N = 22,029) were collected for eight different prompts representing six PARCC and SMARTER Balanced states (three PARCC states and three SMARTER Balanced states). To the extent possible, an attempt was made to make the identity of the participating states anonymous. Three of the states were located in the Northeastern part of the U.S., two from the Mid-west, and one from the West Coast. Because no demographic information was provided by the states, student characteristics were estimated from a number of different sources, as displayed in Table 1. Student writers were drawn from three different grade levels (7, 8, 10) and the grade-level selection was generally a function of the testing policies of the participating states (i.e., a writing component as part of a 10 th grade exit exam), were ethnically diverse, and evenly distributed between males and females. Samples ranging in size from 1527 to 3006 were randomly selected from the data sets provided by the states, and then randomly divided into three sets: a training set, a test set, and a Contrasting Essay Scoring 6 validation set. The training set was used by the vendors to create their scoring models, and consisted of scores assigned by at least two human raters, a final or adjudicated score, and the text of the essay. The test set consisted of essay text only and was used as part of a blind test for the score model predictions. The purpose of the second test set was to calculate scoring engine performance for a public competition that was launched at approximately the same time as the vendor demonstration. It was also to be used as a test set for any of the commercial vendors who might have subsequently elected to participate in the public competition. The second test set consisted of essay text only. The distribution of the samples was split in the following proportions: 60% training sample, 20% test sample, 20% second test sample. The actual proportions vary slightly due to the elimination of cases containing either data errors or text anomalies. The distribution of the samples is displayed in Table 1. Instruments Four of the essays were drawn from traditional writing genre (persuasive, expository, narrative) and four essays were “source-based”, that is, the questions asked in the prompt referred to a source document that students read as part of the assessment. Appendix A lists the prompts, scoring rubrics for raters, adjudication guidelines, and reading material for the sourcebased essays. In the training set, average essay lengths varied from M = 94.39 (SD = 51.68) to M = 622.24 (SD =197.08), Traditional essays were significantly longer (M = 354.18, SD 197.63) than source-based essays (M = 119.97, SD = 58.88; t (13334) = 95.18, p < .05). Five of the prompts employed a holistic scoring rubric, one prompt was scored with a two-trait rubric, and two prompts were scored with a multi-trait rubric, but reported as a holistic score. The type of rubric, scale ranges, scale means and standard deviations, are reported in Tables 2 and 3. Table 2 shows the characteristics of the training set and Table 3 shows the Contrasting Essay Scoring 7 characteristics of the test set. Human rater agreement information is reported in Tables 2 and 3 with associated data for exact agreement, exact+adjacent agreement, kappa, Pearson r, and quadratic-weighted kappa. Quadratic-weighted kappas ranged from 0.66 to 0.85, a typical range for human rater performance in statewide high-stakes testing programs. Procedure Six of the essays sets were transcribed from their original paper-form administration in order to prepare them for processing by automated essay scoring engines. At a minimum, the scoring engines require the essays to be in ASCII format. This process involved retrieving the scanned copies of essays from the state or a vendor serving the state, randomly selecting a sample of essays for inclusion in the study, and then sending the selected documents out for transcription. Both the scanning and transcription steps had the potential to introduce errors into the data that would have been minimized had the essays been directly typed into the computer by the student, the normal procedure for automated essay scoring. Essays were scanned on high quality digital scanners, but occasionally student writing was illegible because the original paper document was written with an instrument that was too light to reproduce well, was smudged, or included handwriting that was undecipherable. In such cases, or if the essay could not be scored by human raters (i.e., essay was off-topic or inappropriate), the essay was eliminated from the analyses. Transcribers were instructed to be as faithful to the written document as possible keeping in mind the extended computer capabilities had they been employed. For example, more than a few students hand-wrote their essays using a print style in which all letters were capitalized. To address this challenge, we instructed the transcribers to capitalize beginning of sentences, proper names, etc. This modification may have corrected errors that would have Contrasting Essay Scoring 8 otherwise been made, but limited the over-identification of capitalization errors that might have been made otherwise by the automated essay scoring engines. The first transcription company serviced four prompts from three states and included 11,496 essays. In order to assess the potential impact of transcription errors, a random sample of 588 essays was re-transcribed and compared on the basis of punctuation, capitalization, misspellings, and skipped data. Accuracy was calculated on the basis of the number of characters and the number of words with an average rate of 98.12%. The second transcription company was evaluated using similar metrics. From a pool of 6006 essays, a random sample of 300 essays was selected for re-transcription. Accuracy for this set of essays was calculated to be 99.82%. Two of the essays were provided in ASCII format by their respective states. The 10 th grade students in those states had typed their responses directly into the computer using webbased software that emulated a basic word processor. Except that the test had been administered by a computer, the conditions for testing were similar to those in states where the essays had been transcribed. One of the key challenges to both sets of data, those that were transcribed and those that were directly typed, was that carriage returns and paragraph formatting meta-tags were missing from the ASCII text. For some of the scoring engines, this omission could have introduced a significant impediment in the engine’s ability to accurately evaluate the underlying structure of the writing, one component in their statistical prediction models. Other than asking each student to retype their original answers into the data sets, there was no way to ameliorate this. Vendors were provided a training set for each of the eight essay prompts. Up to four weeks were allowed to statistically model the data during the “training” phase of the Contrasting Essay Scoring 9 demonstration. In addition, vendors were provided with cut-score information along with any scoring guides that were used in the training of human raters. This supplemental information was employed by some of the vendors to better model score differences for the score points along the state rubric continuum. Two of the essay prompts used trait rubrics to formulate a holistic score by summing some or all of the trait scores. For these two essays, both the holistic and trait scores were provided to the vendors. During the training period, a series of conference calls, with detailed questions and answers, were conducted to clarify the nature of the data sets or to address data problems that arose while modeling the data. For example, the guidelines from one state indicated that the final or resolved score was to be the higher of the two rater scores, but in several cases this was not the case. Rather than modify the resolved score, the vendors were instructed to use it in their prediction models even though it was apparently inconsistent with the state’s guidelines. This operational decision had the potential to negatively impact the scoring engines’ reported capacity to adequately model what the state was trying to accomplish in assigning scores to essays. However, it is acceptable, adding to the robustness of whatever results were obtained, since the study was designed to test how vendors would perform when applying their scoring engines to state-generated data under pragmatic conditions. Stated somewhat differently the consideration of these inconsistencies provided a representation of the typical contextual conditions within which the scoring engines were actually employed. In the “test” phase of the evaluation, vendors were provided data sets that had only the text of essays associated with them, and were asked to make integer score predictions for each essay. They were given a 59-hour period in which to make their predictions and were permitted to eliminate up to 2% of the essay score predications in each data set in case their scoring engine Contrasting Essay Scoring 10 classified the essay as “unscorable”. Even though human raters had successfully rated all the essays in the test set, there were a variety of reasons that any one essay might prove problematic for machine scoring. For example, an essay might have addressed the prompt in a unique enough way to receive a low human score, but be deemed as “off topic” for machine scoring. In real-life situations provisions would be made for these to be scored by human raters. Procedure-Scoring Engines Eight of the nine automated essay scoring engines that were evaluated in the demonstration represented commercial entities and captured over 97% of the current automated scoring market in the United States. The lone non-commercial scoring engine was invited into the demonstration because it was an already existing open-source package that was publicly available on their web site. Below are short descriptions of each engine. A more extensive description can be found in a report to the William and Flora Hewlett Foundation. AutoScore, American Institutes for Research (AIR) Autoscore is an essay scoring engine developed by the American Institutes for Research. The engine is designed to create a statistical proxy for prompt-specific rubrics. The rubrics may be single or multiple trait rubrics. A training set, including known, valid scores, is required to train the engine. The system takes a series of measures on each essay in the remaining training set. These

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ESSAY ASSESSMENT 1 Running head: ESSAY ASSESSMENT Essay Assessment with Latent Semantic Analysis

Latent semantic analysis (LSA) is an automated, statistical technique for comparing the semantic similarity of words or documents. In this paper, I examine the application of LSA to automated essay scoring. I compare LSA methods to earlier statistical methods for assessing essay quality, and critically review contemporary essay-scoring systems built on LSA, including the Intelligent Essay Asses...

متن کامل

SCESS: a WFSA-based automated simplified chinese essay scoring system with incremental latent semantic analysis

Writing in language tests is regarded as an important indicator for assessing language skills of test takers. As Chinese language tests become popular, scoring a large number of essays becomes a heavy and expensive task for the organizers of these tests. In the past several years, some efforts have been made to develop automated simplified Chinese essay scoring systems, reducing both costs and ...

متن کامل

NATIONAL UNIVERSITY OF SINGAPORE School of Computing PH.D DEFENCE - PUBLIC SEMINAR Title: Robust Trait-Specific Essay Scoring Using Neural Networks and Density Estimators

Traditional automated essay scoring systems rely on carefully designed features to evaluate and score essays. The performance of such systems is tightly bound to the quality of the underlying features. However, it is laborious to manually design the most informative features for such a system. In this thesis, we develop a novel approach based on recurrent neural networks to learn the relation b...

متن کامل

Fine-grained essay scoring of a complex writing task for native speakers

Automatic essay scoring is nowadays successfully used even in high-stakes tests, but this is mainly limited to holistic scoring of learner essays. We present a new dataset of essays written by highly proficient German native speakers that is scored using a fine-grained rubric with the goal to provide detailed feedback. Our experiments with two state-of-the-art scoring systems (a neural and a SV...

متن کامل

Topicality-Based Indices for Essay Scoring

In this paper, we address the problem of quantifying the overall extent to which a testtaker’s essay deals with the topic it is assigned (prompt). We experiment with a number of models for word topicality, and a number of approaches for aggregating word-level indices into text-level ones. All models are evaluated for their ability to predict the holistic quality of essays. We show that the best...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012